Topic Significance Ranking of LDA Generative Models

نویسندگان

  • Loulwah AlSumait
  • Daniel Barbará
  • James Gentle
  • Carlotta Domeniconi
چکیده

Topic models, like Latent Dirichlet Allocation (LDA), have been recently used to automatically generate text corpora topics, and to subdivide the corpus words among those topics. However, not all the estimated topics are of equal importance or correspond to genuine themes of the domain. Some of the topics can be a collection of irrelevant or background words, or represent insignificant themes. Current approaches to topic modeling perform manual examination of their output to find meaningful and important topics. This paper presents the first automated unsupervised analysis of LDA models to identify and distinguish junk topics from legitimate ones, and to rank the topic significance. The basic idea consists of measuring the distance between a topic distribution and a ”junk distribution”. In particular, three definitions of ”junk distribution” are introduced, and a variety of metrics are used to compute the distances, from which an expressive figure of topic significance is implemented using a 4-phase Weighted Combination approach. Our experiments on synthetic and benchmark datasets show the effectiveness of the proposed approach in expressively ranking the significance of topics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Discovery based on LDA_col Model and Topic Significance Re-ranking

This paper presents a method to find the topics efficiently by the combination of topic discovery and topic re-ranking. Most topic models rely on the bag-ofwords(BOW) assumption. Our approach allows an extension of LDA model—Latent Dirichlet Allocation_Collocation (LDA_col) to work in corpus such that the word order can be taken into consideration for phrase discovery, and slightly modify the m...

متن کامل

LDA Based Similarity Modeling for Question Answering

We present an exploration of generative modeling for the question answering (QA) task to rank candidate passages. We investigate Latent Dirichlet Allocation (LDA) models to obtain ranking scores based on a novel similarity measure between a natural language question posed by the user and a candidate passage. We construct two models each one introducing deeper evaluations on latent characteristi...

متن کامل

Study of entity-topic models for OOV proper name retrieval

Retrieving Proper Names (PNs) relevant to an audio document can improve speech recognition and content based audio-video indexing. Latent Dirichlet Allocation (LDA) topic model has been used to retrieve Out-Of-Vocabulary (OOV) PNs relevant to an audio document with good recall rates. However, retrieval of OOV PNs using LDA is affected by two issues, which we study in this paper: (1) Word Freque...

متن کامل

Modeling and Leveraging Social Collective Intelligence

The rise of social interactions on the Web requires developing new methods of information organization and discovery. To that end, we propose a generative community-based probabilistic tagging model that can automatically uncover communities of users and their associated tags. We experimentally validate the quality of the discovered communities over the social bookmarking system Delicious. In c...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009